Machine-Learning for DDOS attack detection¶

Ayman BEN HAJJAJ & Jules RUBIN

Description¶

This study is the final project of the Machine-Learning II course at EFREI Paris (Master 1 Data Science & AI 2023). This project aims to detect DDOS attacks in a network using machine learning. The dataset used is the CIC-DDoS2019 dataset. This dataset contains 78 features and 430K rows. The dataset contains 18 types of attacks. The goal is to detect the attacks using machine learning. The taxonomy of attacks present in the dataset are described in the research paper DDoS Evaluation Dataset (CIC-DDoS2019).

Taxonomy of attacks present in the dataset:

Taxonomy of attacks

As we have the choice on the type of attack to detect, we will focus on the detection of 3 types of attacks:

  • UDP : UDP Flood attack
  • Syn : Syn Flood attack
  • DrDoS DNS : DNS amplification attack

We will also provide a method that can classify the benign traffic from the malicious traffic.

Table of contents

  • Machine-Learning for DDOS attack detection
    • Description
    • Preprocessing
      • Importing libraries and dataset
      • Data cleaning
        • Dealing with missing values
        • Dealing with outliers
        • Dealing with categorical variables
      • Balancing the dataset
      • Handling outliers
    • Data visualization
      • Distribution according to the attack type
      • Correlation matrix
    • Feature Dimensionality Reduction
      • PCA
      • Kernel PCA
      • t-SNE
    • Supervised learning for multiclassification
      • LDA
      • QDA
      • PCA + LDA
      • PCA + QDA
      • K-PCA + LDA
      • K-PCA + QDA
        • T-SNE + LDA
      • t-SNE + QDA
    • Unsupervised learning for clustering
      • k-Means
      • GMM
      • DBSCAN
      • Hierarchical clustering
    • Other approaches
      • Decision tree
      • K-Nearst Neighbors
      • Random Forest
    • Conclusion

Preprocessing¶

Importing libraries and dataset¶

In [1]:
# making the necessary imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

import warnings
warnings.filterwarnings('ignore')
In [2]:
df = pd.DataFrame()
df = pd.concat([df, pd.read_parquet('data/Syn-training.parquet')])
df = pd.concat([df, pd.read_parquet('data/DNS-testing.parquet')])
df = pd.concat([df, pd.read_parquet('data/UDP-training.parquet')])
df['Label'].value_counts()
Out[2]:
Syn          43302
Benign       32901
UDP          14792
DrDoS_DNS     3669
MSSQL          145
Name: Label, dtype: int64
In [3]:
# We can remove the MSSQL data as it is not required for our analysis
df = df[df['Label'] != 'MSSQL']
In [4]:
df['Label'].value_counts()
Out[4]:
Syn          43302
Benign       32901
UDP          14792
DrDoS_DNS     3669
Name: Label, dtype: int64

Data cleaning¶

Dealing with missing values¶

In [5]:
# count the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)
Out[5]:
Protocol                0
CWE Flag Count          0
Fwd Avg Packets/Bulk    0
Fwd Avg Bytes/Bulk      0
Avg Bwd Segment Size    0
                       ..
Bwd IAT Total           0
Fwd IAT Min             0
Fwd IAT Max             0
Fwd IAT Std             0
Label                   0
Length: 78, dtype: int64

Dealing with outliers¶

In [6]:
# check for duplicate rows
df.duplicated().sum()
Out[6]:
194
In [7]:
# remove duplicate rows
df.drop_duplicates(inplace=True)

We do not have any missing values in the dataset. However, we have some duplicates. We will remove them.

Dealing with categorical variables¶

In [8]:
# count the number of unique values in each column
df.nunique().sort_values(ascending=True).head(25)
Out[8]:
Bwd Avg Bulk Rate          1
Bwd Avg Packets/Bulk       1
Bwd Avg Bytes/Bulk         1
Fwd Avg Bulk Rate          1
Fwd Avg Packets/Bulk       1
Fwd Avg Bytes/Bulk         1
ECE Flag Count             1
Fwd URG Flags              1
Bwd PSH Flags              1
Bwd URG Flags              1
PSH Flag Count             1
FIN Flag Count             1
URG Flag Count             2
Fwd PSH Flags              2
RST Flag Count             2
ACK Flag Count             2
CWE Flag Count             2
SYN Flag Count             2
Protocol                   3
Label                      4
Down/Up Ratio             15
Bwd IAT Min              105
Bwd Packet Length Min    195
Fwd Act Data Packets     212
Total Fwd Packets        263
dtype: int64

As some columns has only one unique value, they do not bring any information. We will remove them. We can also see that some columns are categorical variables. We will convert them to numerical variables using the OneHotEncoder.

In [9]:
one_value_cols = [col for col in df.columns if df[col].nunique() <= 1]
df = df.drop(one_value_cols, axis=1)
In [10]:
three_value_cols = [col for col in df.columns if df[col].nunique() <= 3]
# One Hot Encoding
df = pd.get_dummies(df, columns=three_value_cols)

Balancing the dataset¶

In [11]:
df['Label'].value_counts()
Out[11]:
Syn          43302
Benign       32707
UDP          14792
DrDoS_DNS     3669
Name: Label, dtype: int64
In [12]:
# As the dataset is imbalanced, we will balance it by taking 3000 samples from each class
balanced = pd.DataFrame()
balanced = pd.concat([balanced, df[df['Label'] == 'Syn'].sample(n=3500)])
balanced = pd.concat([balanced, df[df['Label'] == 'DrDoS_DNS'].sample(n=3500)])
balanced = pd.concat([balanced, df[df['Label'] == 'UDP'].sample(n=3500)])
balanced = pd.concat([balanced, df[df['Label'] == 'Benign'].sample(n=3500)])
df = balanced.copy()
# free up memory
del balanced

Handling outliers¶

In [13]:
df.describe()
Out[13]:
Flow Duration Total Fwd Packets Total Backward Packets Fwd Packets Length Total Bwd Packets Length Total Fwd Packet Length Max Fwd Packet Length Min Fwd Packet Length Mean Fwd Packet Length Std Bwd Packet Length Max ... SYN Flag Count_0 SYN Flag Count_1 RST Flag Count_0 RST Flag Count_1 ACK Flag Count_0 ACK Flag Count_1 URG Flag Count_0 URG Flag Count_1 CWE Flag Count_0 CWE Flag Count_1
count 1.400000e+04 14000.000000 14000.000000 14000.000000 1.400000e+04 14000.000000 14000.000000 14000.000000 14000.000000 14000.000000 ... 14000.000000 14000.000000 14000.000000 14000.000000 14000.000000 14000.000000 14000.000000 14000.000000 14000.000000 14000.000000
mean 1.384735e+07 7.427571 3.811000 1574.796875 2.425408e+03 396.200653 340.513275 354.114136 19.885420 93.736641 ... 0.999500 0.000500 0.970500 0.029500 0.695714 0.304286 0.897286 0.102714 0.943786 0.056214
std 2.724687e+07 37.479830 60.160449 5839.451172 1.047203e+05 511.125458 485.877258 482.500702 69.611465 434.197937 ... 0.022356 0.022356 0.169209 0.169209 0.460121 0.460121 0.303596 0.303596 0.230343 0.230343
min 1.000000e+00 1.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 4.900000e+01 2.000000 0.000000 60.000000 0.000000e+00 6.000000 6.000000 6.000000 0.000000 0.000000 ... 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000
50% 1.053630e+05 4.000000 0.000000 750.000000 0.000000e+00 349.000000 47.000000 125.769230 0.000000 0.000000 ... 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000
75% 6.026834e+06 6.000000 2.000000 2088.000000 2.400000e+01 424.000000 330.000000 359.500000 22.516661 6.000000 ... 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 0.000000 1.000000 0.000000
max 1.199912e+08 3890.000000 6706.000000 130324.000000 1.199727e+07 3495.000000 2033.000000 2033.000000 1217.637817 3560.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 73 columns

In [14]:
# check for the outliers
fig, ax = plt.subplots(4, 2, figsize=(15, 15))
# create a list of columns to check for outliers
outliers_col = ['Flow Duration', 'Total Fwd Packets', 'Total Backward Packets', 'Fwd Packet Length Max',
                'Bwd Packet Length Max', 'Flow Bytes/s', 'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std']
                

# create a for loop to iterate over the columns
for i in range(0, 4):
    for j in range(0, 2):
        col = outliers_col[i * 2 + j]
        # create a boxplot for each column
        sns.boxplot(x=df[col], ax=ax[i, j])
plt.show()
In [15]:
# remove the outliers using IsolationForest
from sklearn.ensemble import IsolationForest

# create an instance of the IsolationForest class
iso = IsolationForest(n_estimators=1000, max_samples='auto', contamination=float(0.05), max_features=1.0,
                        bootstrap=False, n_jobs=-1, random_state=42, verbose=0)
# fit the model
yhat = iso.fit_predict(df.drop('Label', axis=1))
# select all rows that are not outliers
mask = yhat != -1
df = df[mask]
df.shape
Out[15]:
(13300, 74)
In [16]:
df['Label'].value_counts()
Out[16]:
UDP          3500
Syn          3495
DrDoS_DNS    3488
Benign       2817
Name: Label, dtype: int64

We can see that there is a few outliers among the attacks and around 20% of outliers among the benign traffic. We have removed them.

In [17]:
# put the Label colum in first place
df = pd.concat([df['Label'], df.drop('Label', axis=1)], axis=1)
df
Out[17]:
Label Flow Duration Total Fwd Packets Total Backward Packets Fwd Packets Length Total Bwd Packets Length Total Fwd Packet Length Max Fwd Packet Length Min Fwd Packet Length Mean Fwd Packet Length Std ... SYN Flag Count_0 SYN Flag Count_1 RST Flag Count_0 RST Flag Count_1 ACK Flag Count_0 ACK Flag Count_1 URG Flag Count_0 URG Flag Count_1 CWE Flag Count_0 CWE Flag Count_1
32769 Syn 65383745 10 2 60.0 12.0 6.0 6.0 6.00 0.000000 ... 1 0 1 0 0 1 1 0 1 0
27006 Syn 60845011 10 0 60.0 0.0 6.0 6.0 6.00 0.000000 ... 1 0 1 0 0 1 1 0 1 0
42507 Syn 59823050 12 4 72.0 24.0 6.0 6.0 6.00 0.000000 ... 1 0 1 0 0 1 1 0 1 0
7722 Syn 42662938 10 6 60.0 36.0 6.0 6.0 6.00 0.000000 ... 1 0 1 0 0 1 1 0 1 0
40126 Syn 27868675 6 2 36.0 12.0 6.0 6.0 6.00 0.000000 ... 1 0 1 0 0 1 1 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
57191 Benign 25059 2 2 64.0 204.0 32.0 32.0 32.00 0.000000 ... 1 0 1 0 1 0 1 0 1 0
59939 Benign 7321085 2 8 0.0 0.0 0.0 0.0 0.00 0.000000 ... 1 0 1 0 0 1 1 0 1 0
16875 Benign 1 2 0 0.0 0.0 0.0 0.0 0.00 0.000000 ... 1 0 1 0 1 0 0 1 0 1
52300 Benign 634 4 0 129.0 0.0 46.0 6.0 32.25 18.874586 ... 1 0 0 1 1 0 0 1 1 0
51947 Benign 20644 2 2 64.0 120.0 32.0 32.0 32.00 0.000000 ... 1 0 1 0 1 0 1 0 1 0

13300 rows × 74 columns

Data visualization¶

Distribution according to the attack type¶

In [18]:
# Plot the distribution of the Flow Duration for each class
fig, ax = plt.subplots(1, 4, figsize=(20, 5))
sns.distplot(df[df['Label'] == 'Syn']['Flow Duration'], ax=ax[0], color='r')
ax[0].set_title('Syn')
sns.distplot(df[df['Label'] == 'DrDoS_DNS']['Flow Duration'], ax=ax[1], color='b')
ax[1].set_title('DrDoS_DNS')
sns.distplot(df[df['Label'] == 'UDP']['Flow Duration'], ax=ax[2], color='g')
ax[2].set_title('UDP')
sns.distplot(df[df['Label'] == 'Benign']['Flow Duration'], ax=ax[3], color='y')
ax[3].set_title('Benign')
plt.show()
In [19]:
# Plot the distribution of the Total Fwd Packets for each class
fig, ax = plt.subplots(1, 4, figsize=(20, 5))
sns.distplot(df[df['Label'] == 'Syn']['Total Fwd Packets'], ax=ax[0], color='r')
ax[0].set_title('Syn')
sns.distplot(df[df['Label'] == 'DrDoS_DNS']['Total Fwd Packets'], ax=ax[1], color='b')
ax[1].set_title('DrDoS_DNS')
sns.distplot(df[df['Label'] == 'UDP']['Total Fwd Packets'], ax=ax[2], color='g')
ax[2].set_title('UDP')
sns.distplot(df[df['Label'] == 'Benign']['Total Fwd Packets'], ax=ax[3], color='y')
ax[3].set_title('Benign')
plt.show()
In [20]:
# Plot the distribution of the Total Backward Packets for each class
fig, ax = plt.subplots(1, 4, figsize=(20, 5))
sns.distplot(df[df['Label'] == 'Syn']['Total Backward Packets'], ax=ax[0], color='r')
ax[0].set_title('Syn')
sns.distplot(df[df['Label'] == 'DrDoS_DNS']['Total Backward Packets'], ax=ax[1], color='b')
ax[1].set_title('DrDoS_DNS')
sns.distplot(df[df['Label'] == 'UDP']['Total Backward Packets'], ax=ax[2], color='g')
ax[2].set_title('UDP')
sns.distplot(df[df['Label'] == 'Benign']['Total Backward Packets'], ax=ax[3], color='y')
ax[3].set_title('Benign')
plt.show()
In [21]:
# Plot the distribution of the Flow Bytes/s for each class
fig, ax = plt.subplots(1, 4, figsize=(20, 5))
sns.distplot(df[df['Label'] == 'Syn']['Flow Bytes/s'], ax=ax[0], color='r')
ax[0].set_title('Syn')
sns.distplot(df[df['Label'] == 'DrDoS_DNS']['Flow Bytes/s'], ax=ax[1], color='b')
ax[1].set_title('DrDoS_DNS')
sns.distplot(df[df['Label'] == 'UDP']['Flow Bytes/s'], ax=ax[2], color='g') 
ax[2].set_title('UDP')
sns.distplot(df[df['Label'] == 'Benign']['Flow Bytes/s'], ax=ax[3], color='y')
ax[3].set_title('Benign')
plt.show()
In [22]:
# Plot the distribution of the Flow Packets/s for each class
fig, ax = plt.subplots(1, 4, figsize=(20, 5))
sns.distplot(df[df['Label'] == 'Syn']['Flow Packets/s'], ax=ax[0], color='r')
ax[0].set_title('Syn')
sns.distplot(df[df['Label'] == 'DrDoS_DNS']['Flow Packets/s'], ax=ax[1], color='b')
ax[1].set_title('DrDoS_DNS')
sns.distplot(df[df['Label'] == 'UDP']['Flow Packets/s'], ax=ax[2], color='g')
ax[2].set_title('UDP')
sns.distplot(df[df['Label'] == 'Benign']['Flow Packets/s'], ax=ax[3], color='y')
ax[3].set_title('Benign')
plt.show()

We can see that the distribution of the Syn attack is very different from the other classes. The values can be very high when regarding the number of packets sent or the flow duration. However, the DNS attack have some really different charasterisitcs to benign traffic. We can see that the Flow Bytes/s and Flow Packets/s are very high for the DNS attack.

These visualizations help us to have a better understanding of the characteristics of the attacks.

Correlation matrix¶

In [23]:
# Plot the correlation matrix
corr = df.drop('Label', axis=1).corr()
plt.figure(figsize=(10, 10))
sns.heatmap(corr, annot=False, cmap='coolwarm')
plt.show()

We can see that we have high correlation between some features such as Packets Min/Max/Mean and Fwd Packets Min/Max/Mean, Idle Mean/Std/Min/Max and Flow IAT Mean/Std/Min/Max, etc. The OneHotEncoder has created some columns that are perfectly anti-correlated as we can see with the last columns.

The correlations will be handled by the PCA after the Scaling.

Feature Dimensionality Reduction¶

PCA¶

In [24]:
# we standardize the features
from sklearn.preprocessing import StandardScaler
df.reset_index(inplace=True, drop=True)
# separate the features from the labels
X = df.drop('Label', axis=1)
y = df['Label']

# standardize the features
X = StandardScaler().fit_transform(X)
In [25]:
# find the optimal number of components
from sklearn.decomposition import PCA
df_pca = PCA().fit(X)
plt.plot(np.cumsum(df_pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.title('Explained variance vs number of components')
plt.show()

We see that to get 80% of the explained variance, we need to keep 12 components.

In [26]:
pca = PCA(n_components=12)
principalComponents = pca.fit_transform(X)

# create a dataframe with the principal components
df_pca = pd.DataFrame(data=principalComponents, columns=['PC' + str(i) for i in range(1, 13)])

# concatenate the labels to the dataframe
df_pca = pd.concat([df_pca, df[['Label']]], axis=1)

# print the first 5 rows of the dataframe
df_pca.head()
Out[26]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 Label
0 7.126359 -2.049307 -0.218012 -0.075015 -0.457728 -2.197801 -2.642833 1.013314 -0.526303 -0.034664 -0.036337 0.296247 Syn
1 6.535007 -2.192465 -0.668037 0.077000 -0.628068 -2.609685 -2.858930 1.152035 -0.290998 -0.011859 -0.058677 0.169141 Syn
2 9.015369 -2.608999 1.073945 -0.248945 1.312070 2.274186 2.698763 -5.269964 1.029990 0.089698 0.113237 -0.478708 Syn
3 8.304637 -1.874303 1.015959 -0.418661 0.960101 1.781646 1.773516 -3.592095 0.628659 0.049796 0.030184 -0.316561 Syn
4 5.150955 -0.908257 -0.535669 -0.256662 -0.744337 -1.745124 -1.864974 0.952897 -0.182712 -0.009696 -0.097236 0.005497 Syn

in order to plot the PCA, we will keep 2 components.

In [27]:
# plot 
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('PC1', fontsize=15)
ax.set_ylabel('PC2', fontsize=15)
ax.set_title('2 component PCA', fontsize=20)
targets = ['Syn', 'DrDoS_DNS', 'UDP', 'Benign']
colors = ['r', 'b', 'g', 'y']
for target, color in zip(targets, colors):
    indicesToKeep = df_pca['Label'] == target
    ax.scatter(df_pca.loc[indicesToKeep, 'PC1'], df_pca.loc[indicesToKeep, 'PC2'], c=color, s=50 , alpha=0.5)

ax.legend(targets)
ax.grid()
plt.show()
In [28]:
# plot 
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('PC1', fontsize=15)
ax.set_ylabel('PC2', fontsize=15)
ax.set_title('2 component PCA', fontsize=20)
targets = ['Syn', 'DrDoS_DNS', 'UDP', 'Benign']
colors = ['r', 'b', 'g', 'y']
for target, color in zip(targets, colors):
    indicesToKeep = df_pca['Label'] == target
    ax.scatter(df_pca.loc[indicesToKeep, 'PC1'], df_pca.loc[indicesToKeep, 'PC2'], c=color, s=50 , alpha=0.5)
for i, txt in enumerate(df.drop('Label', axis=1).columns):
    plt.arrow(0, 0, 200*pca.components_[0][i], 200*pca.components_[1][i], color='black', alpha=0.2, head_width=0.3, width=.1)
    plt.annotate(txt, (200*pca.components_[0][i], 200*pca.components_[1][i]), size=7, alpha=0.7)

ax.legend(targets)
ax.grid()
plt.show()

Kernel PCA¶

In [29]:
# we do the same thing for a kernel PCA
from sklearn.decomposition import KernelPCA
In [30]:
# we do the same thing for a kernel PCA
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=73, kernel='rbf')
principalComponents = kpca.fit_transform(X)

explained_variance = np.var(principalComponents, axis=0)
explained_variance_ratio = explained_variance / np.sum(explained_variance)

# create a dataframe with the principal components
df_kpca = pd.DataFrame(data=principalComponents, columns=['PC' + str(i) for i in range(1, 74)])

# concatenate the labels to the dataframe
df_kpca = pd.concat([df_kpca, df[['Label']]], axis=1)

# print the first 5 rows of the dataframe
df_kpca.head()
Out[30]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC65 PC66 PC67 PC68 PC69 PC70 PC71 PC72 PC73 Label
0 0.694293 -0.163515 -0.390344 -0.117720 -0.268498 -0.149944 0.094854 -0.175512 -0.030103 0.007337 ... -0.025131 -0.001412 -0.035661 0.001363 -0.011641 0.008888 0.000178 0.004417 -0.002615 Syn
1 0.648484 -0.166958 -0.382211 -0.151421 -0.295766 -0.163081 0.129898 -0.162038 -0.012922 0.007717 ... -0.010119 -0.000632 -0.035534 0.001330 0.028063 -0.021355 -0.001086 -0.002838 0.021355 Syn
2 0.598730 0.057647 0.043366 0.223421 0.495209 0.239183 -0.173741 -0.007366 -0.013864 0.025404 ... 0.006306 -0.000919 -0.043462 0.001733 -0.049193 0.004250 0.015516 -0.034627 0.009158 Syn
3 0.681838 -0.041536 -0.119951 0.216507 0.471789 0.255171 -0.282316 0.032216 -0.095738 0.013088 ... -0.048768 -0.002386 -0.060531 0.004326 -0.012547 0.017568 -0.010350 0.029368 0.017143 Syn
4 0.511454 -0.253601 -0.345547 -0.142021 -0.278727 -0.107336 0.028023 0.083457 -0.026670 -0.010007 ... 0.001433 0.000133 -0.000820 0.002191 -0.017745 0.004362 0.001513 0.004507 -0.012166 Syn

5 rows × 74 columns

In [31]:
# plot cumulative explained variance
plt.plot(np.cumsum(explained_variance_ratio))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.title('Explained variance vs number of components')
plt.show()
In [37]:
# check if the file exists
if not os.path.exists('data/results/df_kpca.parquet'):
    # we do the same thing for a kernel PCA
    kpca = KernelPCA(n_components=8, kernel='rbf')
    # create a dataframe with the principal components
    df_kpca = pd.DataFrame(data=kpca.fit_transform(X), columns=['PC' + str(i) for i in range(1, 9)])

    # concatenate the labels to the dataframe
    df_kpca = pd.concat([df_kpca, df[['Label']]], axis=1)

    # save the dataframe
    df_kpca.to_parquet('data/results/df_kpca.parquet')
else:
    df_kpca = pd.read_parquet('data/results/df_kpca.parquet')

# print the first 5 rows of the dataframe
df_kpca.head()
Out[37]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 Label
0 0.694293 -0.163515 -0.390344 -0.117720 -0.268498 -0.149944 0.094854 -0.175512 Syn
1 0.648484 -0.166958 -0.382211 -0.151421 -0.295766 -0.163081 0.129898 -0.162038 Syn
2 0.598730 0.057647 0.043366 0.223421 0.495209 0.239183 -0.173741 -0.007366 Syn
3 0.681838 -0.041536 -0.119951 0.216507 0.471789 0.255171 -0.282316 0.032216 Syn
4 0.511454 -0.253601 -0.345547 -0.142021 -0.278727 -0.107336 0.028023 0.083457 Syn
In [38]:
# plot 
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('PC1', fontsize=15)
ax.set_ylabel('PC2', fontsize=15)
ax.set_title('2 component K-PCA', fontsize=20)
targets = ['Syn', 'DrDoS_DNS', 'UDP', 'Benign']
colors = ['r', 'b', 'g', 'y']
for target, color in zip(targets, colors):
    indicesToKeep = df_kpca['Label'] == target
    ax.scatter(df_kpca.loc[indicesToKeep, 'PC1'], df_kpca.loc[indicesToKeep, 'PC2'], c=color, s=50 , alpha=0.5)

ax.legend(targets)
ax.grid()
plt.show()

Both the PCA and the kernel-PCA did a good job at reducing the dimensionality of the dataset to a point where we can visualy see the clusters.

but we can see that its hard to distinguish the UDP attacks from the DrDoS_DNS attacks

t-SNE¶

In [39]:
from sklearn.manifold import TSNE

# scatter plot the data
def plot_tsne(df_tsne):
    sns.scatterplot(x='PC1', y='PC2', hue='Label', data=df_tsne)
    plt.show()
In [40]:
tsne = TSNE(n_components=2, verbose=1, random_state=123, perplexity=5)
df_tsne = pd.DataFrame(tsne.fit_transform(df.drop('Label', axis=1)), columns=['PC1', 'PC2'])
df_tsne['Label'] = df['Label'].values
plot_tsne(df_tsne)
[t-SNE] Computing 16 nearest neighbors...
[t-SNE] Indexed 13300 samples in 0.006s...
[t-SNE] Computed neighbors for 13300 samples in 0.772s...
[t-SNE] Computed conditional probabilities for sample 1000 / 13300
[t-SNE] Computed conditional probabilities for sample 2000 / 13300
[t-SNE] Computed conditional probabilities for sample 3000 / 13300
[t-SNE] Computed conditional probabilities for sample 4000 / 13300
[t-SNE] Computed conditional probabilities for sample 5000 / 13300
[t-SNE] Computed conditional probabilities for sample 6000 / 13300
[t-SNE] Computed conditional probabilities for sample 7000 / 13300
[t-SNE] Computed conditional probabilities for sample 8000 / 13300
[t-SNE] Computed conditional probabilities for sample 9000 / 13300
[t-SNE] Computed conditional probabilities for sample 10000 / 13300
[t-SNE] Computed conditional probabilities for sample 11000 / 13300
[t-SNE] Computed conditional probabilities for sample 12000 / 13300
[t-SNE] Computed conditional probabilities for sample 13000 / 13300
[t-SNE] Computed conditional probabilities for sample 13300 / 13300
[t-SNE] Mean sigma: 0.000000
[t-SNE] KL divergence after 250 iterations with early exaggeration: 84.996506
[t-SNE] KL divergence after 1000 iterations: 1.062197

We can see that the t-SNE algorithm is able to separate the different classes. However, the classes are not enough separated. We need to adjust the perplexity parameter in order to have a better separation.

According to this article, The optimal perplexity parameter depends on the number of samples in the dataset. As we have around 13k samples, we can try to set the perplexity to 100.

In order to save_time and avoid to run the t-SNE algorithm for a long time, we will load the data that are already computed and have

In [41]:
# check if the file 'data/results/df_tsne.parquet' exists
if os.path.isfile('data/results/df_tsne.parquet'):
    df_tsne = pd.read_parquet('data/results/df_tsne.parquet')
else:
    tsne = TSNE(n_components=2, verbose=1, random_state=123, perplexity=100)
    df_tsne = pd.DataFrame(tsne.fit_transform(df.drop('Label', axis=1)), columns=['PC1', 'PC2'])
    df_tsne['Label'] = df['Label'].values
plot_tsne(df_tsne)

The result looks very good with a perplexity of 100. We can see that the classes are well separated. We only have more difficulty to separate the UDP attacks and Benign traffic. And the UDP attacks are not well separated from the DrDos DNS attacks. The Syn attacks are well separated from the other classes.

Without the Benign traffic, we can see that the clusters are well separated.

In [42]:
df_tsne_no_benign = df_tsne[df_tsne['Label'] != 'Benign']
plot_tsne(df_tsne_no_benign)
In [43]:
if os.path.isfile('data/results/df_tsne.parquet'):
    df_tsne.to_parquet('data/results/df_tsne.parquet')
if os.path.isfile('data/results/df_tsne_no_benign.parquet'):
    df_tsne_no_benign.to_parquet('data/results/df_tsne_no_benign.parquet')

Supervised learning for multiclassification¶

LDA¶

In [44]:
# copy the dataframe to a new one
df_lda = df.copy()
In [45]:
# perform LDA 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_lda.drop('Label', axis=1), df_lda['Label'], test_size=0.2, random_state=42)

# perform LDA
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# make predictions
y_pred = lda.predict(X_test)

# calculate accuracy
accuracy_score(y_test, y_pred)
Out[45]:
0.9733082706766917

the overall accuracy is verry high, we can calculate the precision and recall and f1 score for each class

In [46]:
# calculate the accuracy for each class
from sklearn.metrics import precision_score, recall_score, f1_score

print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision:  [0.98586572 0.93435754 0.99576869 0.98056801]
Recall:  [0.98412698 0.98964497 0.98879552 0.93314367]
F1:  [0.98499559 0.9612069  0.99226985 0.95626822]
In [47]:
# plot the confusion matrix
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
Out[47]:
<Axes: >
In [48]:
from sklearn.metrics import roc_curve, auc
def plot_ROC(X_train, y_train, X_test, y_test, model):
    for cluster in y_test.unique():
        # compute a temp with true/false values
        y_test_temp = y_test == cluster
        y_train_temp = y_train == cluster

        # fit the model
        model.fit(X_train, y_train_temp)

        # predict probabilities
        y_pred_temp = model.predict_proba(X_test)

        # calculate the fpr and tpr for all thresholds of the classification
        fpr, tpr, threshold = roc_curve(y_test_temp, y_pred_temp[:, 1])
        roc_auc = auc(fpr, tpr)
        print('AUC for class {}: {}'.format(cluster, roc_auc))

        # plot the ROC curve
        plt.plot(fpr, tpr, label='{} {} (AUC = {})'.format(cluster, 'traffic' if cluster == 'Benign' else 'attack', round(roc_auc, 3)))
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([-0.05, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend(loc='lower right')
    plt.show()

plot_ROC(X_train, y_train, X_test, y_test, lda)
AUC for class Syn: 0.9962970799830725
AUC for class UDP: 0.9918445729703562
AUC for class DrDoS_DNS: 0.9895137430807407
AUC for class Benign: 0.9959080870053956

QDA¶

In [49]:
# perform QDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

df_qda = df.copy()

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_qda.drop('Label', axis=1), df_qda['Label'], test_size=0.2, random_state=42)

qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)

# make predictions
y_pred = qda.predict(X_test)

# calculate accuracy
accuracy_score(y_test, y_pred)
Out[49]:
0.9800751879699248
In [50]:
# calculate the accuracy for each class
from sklearn.metrics import precision_score, recall_score, f1_score

print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision:  [0.98434783 0.95272206 1.         0.98375185]
Recall:  [0.99823633 0.98372781 0.99439776 0.94736842]
F1:  [0.99124343 0.96797671 0.99719101 0.96521739]

both overall and per class accuracy are lower than in an LDA

In [51]:
# plot the confusion matrix
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
Out[51]:
<Axes: >
In [52]:
plot_ROC(X_train, y_train, X_test, y_test, qda)
AUC for class Syn: 0.99719744012713
AUC for class UDP: 0.9330789789870553
AUC for class DrDoS_DNS: 0.9871471774193548
AUC for class Benign: 0.9986504945097077

PCA + LDA¶

In [53]:
# perform LDA on the pca data
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_pca.drop('Label', axis=1), df_pca['Label'], test_size=0.2, random_state=42)

# perform LDA
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# make predictions
y_pred = lda.predict(X_test)

accuracy_PCA_LDA = accuracy_score(y_test, y_pred)

# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision:  [0.91958042 0.96050269 0.96302251 0.75247525]
Recall:  [0.92768959 0.79142012 0.83893557 0.97297297]
F1:  [0.92361721 0.86780211 0.89670659 0.84863524]
In [54]:
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
Out[54]:
<Axes: >
In [55]:
plot_ROC(X_train, y_train, X_test, y_test, lda)
AUC for class Syn: 0.9934347839855366
AUC for class UDP: 0.9785538436265919
AUC for class DrDoS_DNS: 0.9809302824966596
AUC for class Benign: 0.9782457018481863

PCA + QDA¶

In [56]:
# perform QDA on the pca data
# perform QDA
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)

# make predictions
y_pred = qda.predict(X_test)

accuracy_PCA_QDA = accuracy_score(y_test, y_pred)

# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision:  [0.98951049 0.97191888 1.         0.92702703]
Recall:  [0.99823633 0.92159763 0.99019608 0.97581792]
F1:  [0.99385426 0.9460896  0.99507389 0.95079695]
In [57]:
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
Out[57]:
<Axes: >
In [58]:
plot_ROC(X_train, y_train, X_test, y_test, qda)
AUC for class Syn: 0.9979686119051938
AUC for class UDP: 0.986067812157692
AUC for class DrDoS_DNS: 0.9914105745371254
AUC for class Benign: 0.994888479360529

K-PCA + LDA¶

In [59]:
# perform LDA on the kpca data
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_kpca.drop('Label', axis=1), df_kpca['Label'], test_size=0.2, random_state=42)

# perform LDA
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# make predictions
y_pred = lda.predict(X_test)

accuracy_KPCA_LDA = accuracy_score(y_test, y_pred)

# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision:  [0.84090909 0.96028881 0.98666667 0.85603113]
Recall:  [0.97883598 0.78698225 0.93277311 0.93883357]
F1:  [0.90464548 0.86504065 0.95896328 0.89552239]
In [60]:
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
Out[60]:
<Axes: >

K-PCA + QDA¶

In [61]:
# perform QDA on the kpca data
# perform QDA

lda = QuadraticDiscriminantAnalysis()
lda.fit(X_train, y_train)

# make predictions
y_pred = lda.predict(X_test)

accuracy_KPCA_QDA = accuracy_score(y_test, y_pred)

# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision:  [0.96428571 0.94971264 0.96433471 0.98962963]
Recall:  [0.95238095 0.97781065 0.98459384 0.95021337]
F1:  [0.95829636 0.96355685 0.97435897 0.96952104]
In [62]:
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
Out[62]:
<Axes: >

T-SNE + LDA¶

In [63]:
# perform LDA on the t-SNE data
# copy the dataframe to a new one
df_tsne_lda = df_tsne.copy()

# mpa each label to a number
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(df_tsne_lda['Label'])
df_tsne_lda['Label'] = le.transform(df_tsne_lda['Label'])
In [64]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_tsne_lda.drop('Label', axis=1), df_tsne_lda['Label'], test_size=0.2, random_state=42)

# perform LDA   
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# make predictions
y_pred = lda.predict(X_test)

accuracy_tsne_LDA = accuracy_score(y_test, y_pred)

# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision:  [0.68235294 0.83310902 0.81342282 0.74038462]
Recall:  [0.4084507  0.91432792 0.84992987 0.87749288]
F1:  [0.51101322 0.87183099 0.83127572 0.80312907]
In [65]:
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
Out[65]:
<Axes: >

t-SNE + QDA¶

In [66]:
# perform QDA on the t-SNE data
# perform QDA
lda = QuadraticDiscriminantAnalysis()
lda.fit(X_train, y_train)

# make predictions
y_pred = lda.predict(X_test)

accuracy_tsne_QDA = accuracy_score(y_test, y_pred)

# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision:  [0.77468354 0.88841202 0.84246575 0.76315789]
Recall:  [0.53873239 0.91728213 0.86255259 0.90883191]
F1:  [0.63551402 0.90261628 0.85239085 0.82964889]
In [67]:
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
Out[67]:
<Axes: >

the accuracy is very bad as we are not supposed to use t-SNE for classification

Unsupervised learning for clustering¶

k-Means¶

In [68]:
from sklearn.cluster import KMeans
In [69]:
# elbow method to choose the number of clusters
distortions = []
K = range(1, 10)
for k in K:
    kmeanModel = KMeans(n_clusters=k, random_state=123)
    kmeanModel.fit(df_tsne.drop('Label', axis=1))
    distortions.append(kmeanModel.inertia_)

plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method to find the optimal k')
plt.show()

According to the elbow method, the optimal number of clusters is 4, which is coherent with the number of classes in the dataset.

In [70]:
# create kmeans object
N = 4
kmeans = KMeans(n_clusters=N, random_state=123)

# fit kmeans object to data
kmeans.fit(df_tsne[['PC1', 'PC2']])

# print location of clusters learned by kmeans object
print(kmeans.cluster_centers_)
[[ 46.65798    -9.370605 ]
 [  5.5882826  43.945114 ]
 [-46.13855     3.4791222]
 [ -4.443057  -40.495094 ]]
In [71]:
# plot the data and the clusters learned
df_tsne['kmeans'] = kmeans.labels_
fig = plt.figure(figsize=(10, 10))
plt.subplot(221)
sns.scatterplot(x='PC1', y='PC2', hue='Label', data=df_tsne)
plt.title('Actual Labels')
plt.subplot(222)
sns.scatterplot(x='PC1', y='PC2', hue='kmeans', data=df_tsne)
plt.title('KMeans with {} Clusters'.format(N))
plt.show()

We can see that k-means is not able to separate the data into the correct clusters. This is because the shape of the k-means' clusters is always spherical, and it looks for clusters of equal variance, which in this case, is not the case.

We will try to use the k-means algorithm with the t-SNE results without the Benign traffic.

In [72]:
# create a plot grid of 2x2
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
# plot the pie charts of distributions of labels for each cluster
for i in range(4):
    df_tsne[df_tsne['kmeans'] == i]['Label'].value_counts().plot.pie(ax=ax[i//2][i%2], autopct='%.2f', fontsize=12)
    ax[i//2][i%2].set_title('Cluster {}'.format(i))
plt.show()

We can see that 3 of the 4 clusters are almost fully constituted of one attack class but we still have trouble to separate the Benign traffic from the attacks as the Benign traffic is mixed among the 4 clusters.

In [73]:
# evaluate the performance of the clustering using the silouette score
from sklearn.metrics import silhouette_score
print(silhouette_score(df_tsne[['PC1', 'PC2']], kmeans.labels_))

# plot the silhouette for the various clusters
from yellowbrick.cluster import SilhouetteVisualizer
visualizer = SilhouetteVisualizer(kmeans, colors='yellowbrick')
visualizer.fit(df_tsne[['PC1', 'PC2']])
visualizer.show()
0.42762336
Out[73]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 13300 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

The silhouette plot is a graphical tool presenting how well our data points fit into the clusters they’ve been assigned to and how well they would fit into other clusters. The silhouette coefficient is a measure of cluster cohesion and separation.

GMM¶

In [74]:
from sklearn.mixture import GaussianMixture
In [75]:
gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=123)
gmm.fit(df_tsne[['PC1', 'PC2']])
df_tsne['gmm'] = gmm.predict(df_tsne[['PC1', 'PC2']])
In [76]:
# plot the results
fig = plt.figure(figsize=(10, 10))
plt.subplot(221)
sns.scatterplot(x='PC1', y='PC2', hue='Label', data=df_tsne, palette='Set1')
plt.title('Original labels')
plt.subplot(222)
sns.scatterplot(x='PC1', y='PC2', hue='gmm', data=df_tsne)
plt.title('GMM with {} components'.format(df_tsne['gmm'].nunique()))
plt.show()
In [77]:
# create a plot grid of 2x2
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
# plot the pie charts of distributions of labels for each cluster
for i in range(4):
    df_tsne[df_tsne['gmm'] == i]['Label'].value_counts().plot.pie(ax=ax[i//2][i%2], autopct='%.2f', fontsize=12)
    ax[i//2][i%2].set_title('Cluster {}'.format(i))
plt.show()

DBSCAN¶

In [78]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.7, min_samples= 15).fit(df_tsne[['PC1', 'PC2']])
df_tsne['dbscan'] = dbscan.labels_
In [79]:
# plot the data and the clusters learned
fig = plt.figure(figsize=(10, 10))
plt.subplot(221)
sns.scatterplot(x='PC1', y='PC2', hue='Label', data=df_tsne)
plt.title('Actual Labels')
plt.subplot(222)
sns.scatterplot(x='PC1', y='PC2', hue='dbscan', data=df_tsne)
plt.title('DBSCAN')
plt.show()

We can see that the DBSCAN algorithm creates more than 100 clusters, wich is not what we want. We need to check for the best eps parameter in order to have a realistic number of clusters.

In [80]:
EPS = [0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9]
fig, ax = plt.subplots(4, 2, figsize=(20, 20))
for eps in EPS:
    dbscan = DBSCAN(eps=eps, min_samples= 20).fit(df_tsne[['PC1', 'PC2']])
    df_tsne['dbscan'] = dbscan.labels_
    # plot the data and the clusters learned
    sns.scatterplot(x='PC1', y='PC2', hue='dbscan', data=df_tsne, ax=ax[EPS.index(eps)//2][EPS.index(eps)%2])
    ax[EPS.index(eps)//2][EPS.index(eps)%2].set_title('DBSCAN with eps={}'.format(eps))
plt.show()

Hierarchical clustering¶

In [81]:
import numpy as np
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
In [82]:
def euclidean_distance(*args):
    return np.sqrt(np.sum((args[0] - args[1]) ** 2))

For the hierarchical clustering, we will first use a subsample of the dataset in order to have a better visualization of the dendrogram.

In [83]:
samples = df_tsne.sample(100)
In [84]:
# compute the distance matrix between all the points
distance_matrix = np.zeros((samples[['PC1', 'PC2']].shape[0], samples[['PC1', 'PC2']].shape[0]))
for i in range(samples[['PC1', 'PC2']].shape[0]):
    for j in range(samples[['PC1', 'PC2']].shape[0]):
        distance_matrix[i, j] = euclidean_distance(samples[['PC1', 'PC2']].iloc[i], samples[['PC1', 'PC2']].iloc[j])

sns.heatmap(distance_matrix)
Out[84]:
<Axes: >
In [85]:
# compute the linkage matrix
Z = linkage(samples[['PC1', 'PC2']], method='average', metric='euclidean')

# plot the dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z, leaf_rotation=90., leaf_font_size=8., labels=samples['Label'].values)
plt.show()

We can see that the hierarchical clustering is able to separate the different types of attacks, however, the benign traffic is mixed among the different clusters.

We can try to perform the hierarchical clustering algorthm after having classified the Benign traffic from the attacks thanks to the LDA model.

In [86]:
samples_no_benign = df_tsne_no_benign.sample(100)
In [87]:
Z_no_benign = linkage(samples_no_benign[['PC1', 'PC2']], method='average', metric='euclidean')

plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z_no_benign, leaf_rotation=90., leaf_font_size=8., labels=samples_no_benign['Label'].values)
plt.show()

We can see that without the Benign class, the hierarchical clustering with euclidean distance has very good results on the 3 classes. We clearly see 3 clusters (orange, green and red) corresponding to each class. However, we still have One big cluster with a mix of UDP, DrDoS_DNS and a little bit of Syn.

In [88]:
Z_full = linkage(df_tsne_no_benign[['PC1', 'PC2']], method='average', metric='euclidean')

plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z_full, leaf_rotation=90., leaf_font_size=8., no_labels=True)
plt.show()

We can see that we have 4 main clusters (orange, green, red and blue) whereas we should get only 3 classes. Let's see the proportions of each class in each cluster.

In [89]:
df_tsne_no_benign['hierarchical'] = cut_tree(Z_full, n_clusters=4)
In [90]:
# plot the pie charts of distributions of labels for each cluster
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
for i in range(4):
    df_tsne_no_benign[df_tsne_no_benign['hierarchical'] == i]['Label'].value_counts().plot.pie(ax=ax[i // 2, i % 2], title='Cluster {}'.format(i), autopct='%.2f')
plt.show()

We can see that we have 3 clusters that are almost only composed of one class. The DrDoS DNS and UDP attacks can't be sparated in the same cluster. However, the Syn attacks are well separated from the other classes.

In [91]:
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
for cluster in range(0, df_tsne_no_benign['hierarchical'].nunique()):
    sns.scatterplot(x='PC1', y='PC2', hue='Label', data=df_tsne_no_benign[df_tsne_no_benign['hierarchical'] == cluster], ax=ax[cluster // 2, cluster % 2])
    ax[cluster // 2, cluster % 2].set_title('Cluster {}'.format(cluster))
# set the title of the plot
plt.suptitle('Hierarchical Clustering')
plt.show()

Other approaches¶

In [92]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('Label', axis=1), df['Label'], test_size=0.2, random_state=42)
from sklearn.tree import DecisionTreeClassifier

Decision tree¶

In [93]:
# train a decision tree classifier on the training set
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# predict the labels of the test set
y_pred = clf.predict(X_test)

# compute the accuracy of the predictions
accuracy_score(y_test, y_pred)
Out[93]:
0.9853383458646616
In [94]:
# plot confusion matrix
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
Out[94]:
<Axes: >

K-Nearst Neighbors¶

In [95]:
# train a knn classifier on the training set
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier()
clf.fit(X_train, y_train)

# predict the labels of the test set
y_pred = clf.predict(X_test)

# compute the accuracy of the predictions
accuracy_score(y_test, y_pred)
Out[95]:
0.9635338345864661
In [96]:
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
Out[96]:
<Axes: >

Random Forest¶

In [97]:
# train a random forest classifier on the training set
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=123)
clf.fit(X_train, y_train)

# predict the labels of the test set
y_pred = clf.predict(X_test)

# compute the accuracy of the predictions
accuracy_score(y_test, y_pred)
Out[97]:
0.9898496240601504
In [98]:
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
Out[98]:
<Axes: >

Conclusion¶

In [99]:
accuracy_LDA = 0.9763246899661782
accuracy_QDA = 0.9676813228109733
accuracy_DT = 0.9898496240601504
accuracy_KNN = 0.9616541353383459
accuracy_RF = 0.9902255639097745

x=['LDA', 'QDA', 'PCA/LDA', 'PCA/QDA', 'KPCA/LDA', 'KPCA/QDA', 't-SNE/LDA', 't-SNE/QDA', 'DT', 'KNN', 'RF']
y=[accuracy_LDA, accuracy_QDA, accuracy_PCA_LDA, accuracy_PCA_QDA, accuracy_KPCA_LDA, accuracy_KPCA_QDA, accuracy_tsne_LDA, accuracy_tsne_QDA, accuracy_DT, accuracy_KNN, accuracy_RF]

accuracies = {
    label: accuracy for label, accuracy in zip(x, y)
}
accuracies = {k: v for k, v in sorted(accuracies.items(), key=lambda item: item[1])}

plt.figure(figsize=(15, 7))
sns.barplot(x=list(accuracies.keys()), y=list(accuracies.values()))
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.ylim(0.75, 1)
plt.show()

According to our study, we can get some conclusions:

  • Dimensionality reduction with PCA has not been very efficient. The PCA algorithm is not able to separate the different classes. This is because the PCA algorithm is a linear algorithm and the classes are not linearly separable. The Kernel PCA algortihm and the t-SNE algorithm have been more efficient at separating the classes. They are non-linear algorithms.
  • Supervised Machine-Learning algortihms we learned in class have been very good at classifying the attacks types on the raw datas (without Dimensionality reduction).
  • When we tried to apply the supervised algorithms on the dimensionality reduced data, the results were very different with first PCA/LDA and PCA/QDA that were not as good as the simple LDA and QDA. The KPCA has slightly decreased the accuracy of the QDA model compared to the simple one but has decreased the accuracy of the LDA model. Finally, the t-SNE model has decreased the accuracy of both LDA and QDA models. This is because the t-SNE algorithm is not supposed to be used for classification but is just a visualization tool.
  • The results of the different unsupervised algorithms have been very different. The k-means algorithm has not been able to separate the classes as the k-means' clusters are spherical and the data are not linearly separable. The GMM algorithm has been able to separate the classes but not as good as the hierarchical clustering algorithm. The DBSCAN algorithm has not been able to separate the classes as it has created more than 100 clusters. Finally, the hierarchical clustering algorithm has been able to separate the classes very well.
  • When we tried other supervised algorithms such as Decision Tree, K-Nearest Neighbors and Random Forest, we got medium results with KNN and some very good results with Random Forest and Decision Tree. This is because the Decision Tree and Random Forest are perfect for this kind of study as they are able to handle non-linear data and are able to classify the data with a hierarchical approach.